This version of practical manual includes output of the scripts and answers.

In this part of the course we will look into basic steps of the pathway enrichment analysis. There are many different methods, implementations and tools for performing this.

Some up-to-date web-based enrichment tools are, for example:

Here we will focus on ORA (overrepresentation analysis) and GSEA (Gene Set Enrichment Analysis), as implemented in ClusterProfiler R package (Yu et al. 2012).

1 Preparation

1.1 Install necessary packages

# These packages are from CRAN
# install.packages('knitr')       # HTML reports
# install.packages('kableExtra')  # Simple HTML tables
# install.packages('data.table')  # Fast data reader
# install.packages('ggplot2')     # Plotting framework
# install.packages('dplyr')       # Powerful package for data management

# These packages are from Bioconductor
# source("https://bioconductor.org/biocLite.R")     # Indicate the package repository
# biocLite("clusterProfiler")                       # Package for enrichment tests
# biocLite("org.Hs.eg.db")                          # Human gene annotation package
# biocLite("pathview")                              # For visualization of enriched KEGG pathways
# biocLite("topGO")    # Package for GO enrichment analysis
# biocLite("DOSE")                                  # Package for disease enrichment analysis

1.2 Load necessary packages

library(knitr)
library(kableExtra)
library(data.table)
library(clusterProfiler)
library(pathview)
library(ggplot2)

1.3 Make working directory for this practical

# Create the new directory 
#dir.create("~/PathwayAnalysisPracticalSession/")
# Change the working directoy 
#setwd("~/PathwayAnalysisPracticalSesion/")

1.4 Set seed for the R session

Some of the commands we use in this practical use random processes, such as random permutations. This means that each time the script is ran, the outcome may differ very slightly, just by chance. To make your code reproducible it is good idea to define the number “seed” in the beginning of the script. This way the script gives exactly the same output every time it is ran.

set.seed(123)

2 Prepare the data

2.1 Read in the data from RNA-seq differential analysis

We will read in the results of differential expression analysis from previous practical. We will concentrate on anti-CD3-stimulated dataset in this practical.

cd3 <- fread('/Users/urmovosa/Documents/Teaching/iBMS/practical/RNA_seq_part/DEG_lists/diffExpGenes_cd3_all.csv')

2.2 Prepare data for ORA and GSEA

Next we will do some data cleaning and convert gene IDs into usable format.

# Explore the data
summary(cd3)
##       V1               baseMean         log2FoldChange       lfcSE      
##  Length:23228       Min.   :      0.0   Min.   :-2.920   Min.   :0.000  
##  Class :character   1st Qu.:      0.4   1st Qu.:-0.213   1st Qu.:0.178  
##  Mode  :character   Median :     40.2   Median :-0.005   Median :0.272  
##                     Mean   :   1516.7   Mean   : 0.034   Mean   :0.276  
##                     3rd Qu.:    982.1   3rd Qu.: 0.258   3rd Qu.:0.366  
##                     Max.   :1388488.9   Max.   : 3.569   Max.   :0.476  
##                                         NA's   :3287     NA's   :3287   
##       stat            pvalue           padj      
##  Min.   :-8.123   Min.   :0.000   Min.   :0.000  
##  1st Qu.:-0.923   1st Qu.:0.054   1st Qu.:0.124  
##  Median :-0.017   Median :0.321   Median :0.440  
##  Mean   : 0.064   Mean   :0.397   Mean   :0.451  
##  3rd Qu.: 1.059   3rd Qu.:0.727   3rd Qu.:0.757  
##  Max.   : 8.822   Max.   :1.000   Max.   :1.000  
##  NA's   :3287     NA's   :3287    NA's   :6744
# There are some log2FCs, lfcSE marked as "NAs". What is the reason?
head(cd3[is.na(cd3$log2FoldChange), ])
##         V1 baseMean log2FoldChange lfcSE stat pvalue padj
## 1:   A4GNT        0             NA    NA   NA     NA   NA
## 2:    AA06        0             NA    NA   NA     NA   NA
## 3:  AACSP1        0             NA    NA   NA     NA   NA
## 4:   AADAC        0             NA    NA   NA     NA   NA
## 5: AADACL3        0             NA    NA   NA     NA   NA
## 6: AADACL4        0             NA    NA   NA     NA   NA
# How many such genes?
table(cd3[is.na(cd3$log2FoldChange), ]$baseMean == 0)
## 
## TRUE 
## 3287

We will remove the genes with mean base expression level 0, as those were not expressed in our data.

# Data cleaning, remove unexpressed genes (baseMean = 0)
cd3 <- cd3[!cd3$baseMean == 0, ]

# convert HGNC gene symbols to more stable ENTREZ IDs
entrez <- bitr(cd3$V1, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")

# Convert ENTREZ to text, so that R would process it correctly
entrez$ENTREZID <- as.character(entrez$ENTREZID)

As you may see from the R message, ~10% of HGNC symbols were not connected with any ENTREZ ID. This is due to the unstable nature of HGNC annotation. In real-life science project you should opt for more stable annotation scheme (e.g. use one ENSEMBL/UCSC annotation database version throughout project).

For the sake of brevity and simplicity, in this practical we remove unlinked gene IDs from following steps and proceed with remaining 90% of genes.

# Add column with ENTREZ IDs to the table, remove rows where ENTREZ was missing
cd3 <- merge(cd3, entrez, by.x = 'V1', by.y = 'SYMBOL')

We will prepare the data for ORA. In ORA we will use genes which are significantly differentically expressed (FDR<0.05, absolute fold-change>2). Fold change is often used as additional filter for identifying most relevant genes for interpretation.

# Prepare the data for overrepresentation tests
# Filter in significant (adjusted P<0.05) results
# Filter in only genes with larger than than 2-fold expression change 
cd3_sig <- cd3[cd3$padj < 0.05 & abs(cd3$log2FoldChange) > 1, ]

Next we will prepare data for GSEA analysis. For that we need to order all genes based on their correlation strenght with the phenotype or effect size. In our case we will order those based on \(log_2(FC)\). We will also extract the vector of all tested genes which can be used as “gene universe”.

# Order gene table by effect size (log2 fold-change), from largest to smallest
cd3 <- cd3[order(cd3$log2FoldChange, decreasing = T), ]

# Look at the range of fold changes
#png('fold_changes.png', units = 'in', height = 7, width = 7)
plot(cd3$log2FoldChange, xlab = 'Gene', ylab = 'log2(FC)')

#dev.off()

# Prepare data for GSEA (vector of log2(FC)'s, named by ENTREZ IDs)
cd3_gsea <- cd3$log2FoldChange
names(cd3_gsea) <- cd3$ENTREZID
# This vector of gene names corresponds to all genes we tested in the analysis and will be used later as "gene universe"

3 Over-representation analyses (ORA)

3.1 Run KEGG over-representation analysis

In this part we run enrichment test (hypergeometric test) for all differentially expressed genes (FDR<0.05, FC>2).

  • We use all genes tested in the RNA-profiling study as a background set or “gene universe”. This list was already constructed in the previous step.

  • We will query for all enrichment results and write out 25 most differentially expressed, not accounting the significance.

  • The default multiple testing is done by Benjamini-Hochberg method, flag pvalueCutoff can be used for filtering only significant results (<0.05). Additionally, qvalueCutoff flag is used to filter results based on another popular method of FDR estimation (Storey q-value).

  • The defaults of the command also apply restrictions on the sizes of gene sets tested in the analysis, those can be seen with command ?enrichKEGG.

KEGG_all <- enrichKEGG(gene = cd3_sig$ENTREZID,
                 organism = 'hsa',
                 universe = cd3$ENTREZID,
                 pvalueCutoff = 1, 
                 qvalueCutoff = 1)

# this command converts ENTREZ IDs back to HGNC symbols to ease interpretation
KEGG_all <- setReadable(KEGG_all, OrgDb = 'org.Hs.eg.db', keytype="ENTREZID")

We will accustom ourselves with the data structure of the analysis output.

# show the data structure
# str(KEGG_all)

The resulting object is more complex R S4 data structure which consists separate “containers” for different results, analysis settings, etc. You can access to those containers by using @.

# look at some slots
head(KEGG_all@result, 5)  # result table
##                ID                            Description GeneRatio
## hsa04060 hsa04060 Cytokine-cytokine receptor interaction    56/423
## hsa04640 hsa04640             Hematopoietic cell lineage    24/423
## hsa05321 hsa05321       Inflammatory bowel disease (IBD)    18/423
## hsa05310 hsa05310                                 Asthma    11/423
## hsa04110 hsa04110                             Cell cycle    24/423
##           BgRatio       pvalue     p.adjust       qvalue
## hsa04060 253/6282 3.309626e-16 9.300048e-14 8.326321e-14
## hsa04640  93/6282 5.627749e-09 7.906987e-07 7.079116e-07
## hsa05321  63/6282 8.685665e-08 8.135573e-06 7.283768e-06
## hsa05310  27/6282 5.523284e-07 3.880107e-05 3.473855e-05
## hsa04110 124/6282 1.889612e-06 1.061962e-04 9.507733e-05
##                                                                                                                                                                                                                                                                                                                                          geneID
## hsa04060 BMPR2/CCL1/CCL24/CCL3/CCL4/CCR1/CCR9/CD27/CD70/CSF1R/CSF2/CX3CR1/CXCL10/CXCL13/EDAR/IFNG/IL10/IL12RB2/IL13/IL13RA1/IL17A/IL17F/IL18RAP/IL19/IL1R2/IL21/IL22/IL23R/IL26/IL2RA/IL3/IL4/IL5/IL7/IL7R/IL9/IL9R/KDR/LEPR/LIF/LTA/OSM/PDGFA/PDGFB/PDGFC/TGFB2/TGFBR2/TNFRSF10D/TNFRSF11B/TNFRSF13B/TNFRSF18/TNFRSF4/TNFRSF8/TNFSF9/XCL1/XCL2
## hsa04640                                                                                                                                                                                                   CD33/CD36/CD38/CD9/CR1/CR2/CSF1R/CSF2/FCER2/HLA-DMB/HLA-DOA/HLA-DQA1/HLA-DQB1/IL1R2/IL2RA/IL3/IL4/IL5/IL7/IL7R/IL9R/ITGA5/ITGAM/TFRC
## hsa05321                                                                                                                                                                                                                         HLA-DMB/HLA-DOA/HLA-DQA1/HLA-DQB1/IFNG/IL10/IL12RB2/IL13/IL17A/IL17F/IL18RAP/IL21/IL22/IL23R/IL4/IL5/MAF/TGFB2
## hsa05310                                                                                                                                                                                                                                                                       HLA-DMB/HLA-DOA/HLA-DQA1/HLA-DQB1/IL10/IL13/IL3/IL4/IL5/IL9/PRG2
## hsa04110                                                                                                                                                                                             BUB1/BUB1B/CCNB1/CCNB2/CCND2/CCNE1/CDC20/CDC25A/CDC25C/CDC45/CDC6/CDKN2A/CDKN2B/CHEK1/E2F1/ESPL1/MAD2L1/MCM2/MCM4/ORC1/ORC6/TGFB2/TTK/WEE1
##          Count
## hsa04060    56
## hsa04640    24
## hsa05321    18
## hsa05310    11
## hsa04110    24
KEGG_all@ontology         # what ontology was tested
## [1] "KEGG"
KEGG_all@pvalueCutoff     # what P-value cutoff was used
## [1] 1

Now we look at the results by printing out 25 first rows of the result table. View function in RStudio makes it very comfortable.

# remove 8. row which is long and uninformative
kable(head(KEGG_all@result[, -8], 25))
ID Description GeneRatio BgRatio pvalue p.adjust qvalue Count
hsa04060 hsa04060 Cytokine-cytokine receptor interaction 56/423 253/6282 0.0000000 0.0000000 0.0000000 56
hsa04640 hsa04640 Hematopoietic cell lineage 24/423 93/6282 0.0000000 0.0000008 0.0000007 24
hsa05321 hsa05321 Inflammatory bowel disease (IBD) 18/423 63/6282 0.0000001 0.0000081 0.0000073 18
hsa05310 hsa05310 Asthma 11/423 27/6282 0.0000006 0.0000388 0.0000347 11
hsa04110 hsa04110 Cell cycle 24/423 124/6282 0.0000019 0.0001062 0.0000951 24
hsa04630 hsa04630 Jak-STAT signaling pathway 26/423 147/6282 0.0000041 0.0001940 0.0001737 26
hsa05330 hsa05330 Allograft rejection 11/423 35/6282 0.0000108 0.0004332 0.0003878 11
hsa05140 hsa05140 Leishmaniasis 15/423 71/6282 0.0000573 0.0020110 0.0018005 15
hsa04658 hsa04658 Th1 and Th2 cell differentiation 17/423 90/6282 0.0000837 0.0026146 0.0023408 17
hsa04940 hsa04940 Type I diabetes mellitus 10/423 39/6282 0.0001861 0.0052288 0.0046814 10
hsa03460 hsa03460 Fanconi anemia pathway 11/423 47/6282 0.0002155 0.0055063 0.0049298 11
hsa04672 hsa04672 Intestinal immune network for IgA production 10/423 45/6282 0.0006484 0.0151835 0.0135937 10
hsa05320 hsa05320 Autoimmune thyroid disease 10/423 48/6282 0.0011084 0.0239576 0.0214492 10
hsa04659 hsa04659 Th17 cell differentiation 16/423 105/6282 0.0016131 0.0316147 0.0283046 16
hsa04610 hsa04610 Complement and coagulation cascades 12/423 68/6282 0.0017079 0.0316147 0.0283046 12
hsa04380 hsa04380 Osteoclast differentiation 18/423 126/6282 0.0018001 0.0316147 0.0283046 18
hsa05332 hsa05332 Graft-versus-host disease 8/423 37/6282 0.0026938 0.0445276 0.0398655 8
hsa05150 hsa05150 Staphylococcus aureus infection 9/423 46/6282 0.0030767 0.0480301 0.0430013 9
hsa04145 hsa04145 Phagosome 19/423 144/6282 0.0034398 0.0508721 0.0455458 19
hsa04151 hsa04151 PI3K-Akt signaling pathway 35/423 326/6282 0.0036621 0.0514532 0.0460660 35
hsa01521 hsa01521 EGFR tyrosine kinase inhibitor resistance 12/423 79/6282 0.0061392 0.0821487 0.0735477 12
hsa04350 hsa04350 TGF-beta signaling pathway 12/423 82/6282 0.0082770 0.1057194 0.0946505 12
hsa05166 hsa05166 HTLV-I infection 27/423 249/6282 0.0089486 0.1093283 0.0978815 27
hsa00380 hsa00380 Tryptophan metabolism 7/423 37/6282 0.0105330 0.1233244 0.1104122 7
hsa04657 hsa04657 IL-17 signaling pathway 12/423 86/6282 0.0119855 0.1347165 0.1206115 12

Q1: How many KEGG pathways are significant if we consider Benjamini-Hochberg P<0.05?

A: 18

Q2: How many KEGG pathways are significant if we consider Storey Q<0.05?

A: 20

Q3: Authors of the original paper (Quinn et al. 2015, link) have already applied the KEGG overrepresentation test to the same data. Find the results from the publication and compare to your results- are those similar? If not exactly, speculate, what could be the reason(s) of observed differences?

A: Quite similar. Possible sources of difference:

-Possibly different ORA method.

-Possibly different database version.

-Possibly some difference in preprocessing of the data (e.g. normalization).

-Most probably different defaults of ORA (e.g. gene set size 10-500).

Q4: Use Fisher’s exact test to test the over-representation of the fifth most significant gene set enrichment result.

  • Google “Fisher’s Exact test R”

  • Construct the contigency table based on GeneRatio and BgRatio columns in result table- you can use this code snippet to replace “NAs” with the correct numbers from the result table:

GeneMatrixFisher <-
       matrix(c(NA, NA, NA, NA),
       nrow = 2,
       dimnames = list(c("Diff_expressed", "Not_diff_expressed"),
       c("Part_of_pathway", "Not_part_of_pathway")))
  • Tip: you can use command addmargins on your GeneMatrixFisher to see if your contigency table adds up.

  • Use the GeneMatrixFisher to run Fisher’s Exact Test.

Q4.1: Report the used command, Fisher’s exact test P-value and odds ratio.

A:

Pathway:

Cell cycle

Contingency table:

GeneMatrixFisher <-
       matrix(c(24, 124 - 24, 423 - 24, 6280 - 423 - (124 - 24)),
       nrow = 2,
       dimnames = list(c("Diff_expressed", "Not_diff_expressed"),
       c("Part_of_pathway", "Not_part_of_pathway")))

GeneMatrixFisher
##                    Part_of_pathway Not_part_of_pathway
## Diff_expressed                  24                 399
## Not_diff_expressed             100                5757

Command:

fisher.test(GeneMatrixFisher, alternative = 'greater')

Output:

fisher.test(GeneMatrixFisher, alternative = 'greater')
## 
##  Fisher's Exact Test for Count Data
## 
## data:  GeneMatrixFisher
## p-value = 1.9e-06
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  2.270164      Inf
## sample estimates:
## odds ratio 
##   3.461707

Q4.2: Does Fisher’s exact test give similar/same results as ClusterProfiler?

A: When considering rounding, yes. Hypergeometric test is actually identical with one-sided Fisher’s Exact Test.

# Fisher's Exact Test:

GeneMatrixFisher <-
       matrix(c(32, (2050 - 32), (45-32), (6194 - 2050 - 13)),
       nrow = 2,
       dimnames = list(c("Part_of_pathway", "Not_part_of_pathway"), 
                       c("Diff_exp", "Not_diff_exp")))


fisher.test(GeneMatrixFisher, alternative = 'greater')

3.1.1 Visualize the overall KEGG enrichment analysis results

We can visualize our results on a barplot. Length of the bar shows significance (\(-log_{10}(P)\)) and the color shows number of genes overlapping with gene set.

input_barplot <- KEGG_all@result[, -8]
input_barplot$Description <- factor(input_barplot$Description, levels = rev(as.character(input_barplot$Description)))
# here we apply the default significance thresholds (Benjamini-Hochberg P<0.05 and Storey Q-value<0.2)
input_barplot <- input_barplot[input_barplot$p.adjust < 0.05 & input_barplot$qvalue < 0.2, ]

ggplot(input_barplot, aes(x = Description, y = -log10(pvalue), fill = Count)) + geom_bar(stat = 'identity') + 
  theme_classic() + 
  coord_flip() + scale_fill_continuous(low = 'lightblue', high = 'salmon')

Next we visualize five most enriched pathways on the network graph. Size of the pathway node shows statistical significance and links indicate gene membership in the pathway. This gives the static graph as a output. If it is necessary to investigate further, you can use flag fixed = FALSE to see interactive version.

cnetplot(KEGG_all, categorySize = "pvalue", showCategory = 5, foldChange = cd3_gsea, cex = 0.1)

# Interactive version
# cnetplot(KEGG_all, categorySize = "pvalue", showCategory = 5, foldChange = cd3_gsea, cex = 0.1, fixed = F)

Q5: What are the most up- and down-regulated genes from 5 most enriched pathways, based on visual inspection?

A: Up-regulated: IFNG, down-regulated: KDR.

Q6: Is there any of the top pathways showing coordinated up- or down-regulation for the majority (or all) of its members?

A: Yes, “Cell cycle”, only one differentially expressed gene is down-regulated.

3.1.1.1 Visualize the most significantly enriched KEGG pathway

For KEGG pathways it is possible to visualize the location of differentially expressed genes on the pathway, as well as their magnitude of differential expression. This kind of visualization might help to identify most relevant genes for given condition and generate new research hypotheses.

pathview(gene.data = cd3_gsea,
         pathway.id = KEGG_all@result[1, ]$ID,
         species = "hsa",
         limit = list(gene = max(abs(cd3_gsea)), cpd = 1), 
         out.suffix = "most_sig_pathway")

For adding the plot to the HTML document: in the following command, replace the file name with the name of the file which was written to your work directory.

![]([YourPathName].pathview.png)

Q7: Same figure is also reported in the original publication (Quinn et al. 2015, link). Does it look similar or do you see any difference? Speculate, where can it come from?

A: Results are in quite good concordance with published results. There is some minor difference between two sets of results- this could be caused by the same reasons as explained in Q3.

3.2 Run Gene Ontology over-representation analysis

Next we run the overrepresentation test for Gene Ontologies.

  • For the sake of brevity, we test only the ontologies from Biological Process category. By setting the ont flag, is also possible to test Molecular Function (MF), Cellular Component (CC) or all three categories combined (ALL).

  • For visualization purposes we apply this time significance thresholds (Benjamini-Hochberg FDR<0.05 and Storey FDR<0.05).

  • We specify that we test the enrichment for the gene sets with sizes between 10-10000 genes (flags minGSSize and maxGSSize)

GO_all <- enrichGO(gene = cd3_sig$ENTREZID,
                   universe = cd3$ENTREZID,
                   OrgDb = 'org.Hs.eg.db',
                   ont = 'BP',
                   pvalueCutoff = 0.05, 
                   qvalueCutoff = 0.05,
                   minGSSize = 10, 
                   maxGSSize = 10000)

# remove 8. row which is long and uninformative
kable(head(GO_all@result[, -8], 15))
ID Description GeneRatio BgRatio pvalue p.adjust qvalue Count
GO:0008283 GO:0008283 cell proliferation 192/836 1756/14300 0 0 0 192
GO:0002376 GO:0002376 immune system process 242/836 2431/14300 0 0 0 242
GO:0006950 GO:0006950 response to stress 292/836 3271/14300 0 0 0 292
GO:0006955 GO:0006955 immune response 177/836 1681/14300 0 0 0 177
GO:0007166 GO:0007166 cell surface receptor signaling pathway 230/836 2405/14300 0 0 0 230
GO:0006952 GO:0006952 defense response 144/836 1305/14300 0 0 0 144
GO:0042127 GO:0042127 regulation of cell proliferation 148/836 1375/14300 0 0 0 148
GO:0006260 GO:0006260 DNA replication 51/836 270/14300 0 0 0 51
GO:0050896 GO:0050896 response to stimulus 522/836 7155/14300 0 0 0 522
GO:0006954 GO:0006954 inflammatory response 81/836 589/14300 0 0 0 81
GO:0044763 GO:0044763 single-organism cellular process 639/836 9331/14300 0 0 0 639
GO:0051716 GO:0051716 cellular response to stimulus 444/836 5880/14300 0 0 0 444
GO:0051240 GO:0051240 positive regulation of multicellular organismal process 138/836 1314/14300 0 0 0 138
GO:1903047 GO:1903047 mitotic cell cycle process 100/836 841/14300 0 0 0 100
GO:0034097 GO:0034097 response to cytokine 92/836 768/14300 0 0 0 92

3.2.1 Visualize the GO enrichment analysis results

To ease the interpretation and see the hierarchical relationships between most enriched GO terms, we visualize those on graph structure. Square boxes are significantly enriched terms and color depicts the significance (P-value). We will visualize 15 most significant terms.

#png('GO_plot.png', height = 10, width = 10, units = 'in', res = 400)

plotGOgraph(GO_all, firstSigNodes = 15)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 53 
## Number of Edges = 86 
## 
## $complete.dag
## [1] "A graph with 53 nodes."
#dev.off()

Q8: What is the most significant GO term?

A: Cell proliferation

Q9: What are the main general themes coming out of the analysis?

A: Look at the hierarchical structure indicates mainly immune-related processes and cell proliferation.

3.3 Run Disease Ontology over-representation analysis

Clusterprofiler allows to do also ORA for human diseases, making use of several databases of curated gene-disease relationships. As our trait of interest is Coeliac disease, it would be interesting to see differentially expressed genes also show enrichment of some human diseases and whether those are could be related with Coeliac disease or any other autoimmune disease.

Q10: If you see that the genes differentially expressed in your disease-of-interest (in our case Coeliac disease) are enriched by genes known to be associated with some other disease (e.g. inflammatory bowel disease, another autoimmune disease), what is the biological conclusion you make? Could this observation be useful for the search of the treatment and how would you proceed with that?

A: This could mean that there might be some similar mechanisms playing role in the pathogenesis of those two diseases. You can imagine that if there is known treatment for disease which comes significant, it might point you to leads for new potential drugs/treatments.

We will run DO enrichment analysis by asking all the results (no filtering based on FDR) and otherwise with default settings (minGSSize = 10, maxGSSize = 500).

DO_all <- enrichDO(gene = cd3_sig$ENTREZID,
                   universe = cd3$ENTREZID,
                   pvalueCutoff = 1, 
                   qvalueCutoff = 1)

DO_all <- setReadable(DO_all, OrgDb = 'org.Hs.eg.db', keytype = "ENTREZID")

# remove 8. column which is long and uninformative
kable(head(DO_all@result[, -8], 15))
ID Description GeneRatio BgRatio pvalue p.adjust qvalue Count
DOID:850 DOID:850 lung disease 77/524 458/6862 0e+00 0.00e+00 0.00e+00 77
DOID:0050161 DOID:0050161 lower respiratory tract disease 78/524 467/6862 0e+00 0.00e+00 0.00e+00 78
DOID:74 DOID:74 hematopoietic system disease 71/524 434/6862 0e+00 1.00e-07 0.00e+00 71
DOID:1176 DOID:1176 bronchial disease 33/524 137/6862 0e+00 3.00e-07 2.00e-07 33
DOID:2841 DOID:2841 asthma 30/524 119/6862 0e+00 4.00e-07 3.00e-07 30
DOID:2320 DOID:2320 obstructive lung disease 50/524 282/6862 0e+00 1.20e-06 8.00e-07 50
DOID:2723 DOID:2723 dermatitis 34/524 166/6862 1e-07 7.70e-06 5.00e-06 34
DOID:37 DOID:37 skin disease 51/524 317/6862 2e-07 1.73e-05 1.12e-05 51
DOID:3905 DOID:3905 lung carcinoma 67/524 470/6862 2e-07 1.85e-05 1.20e-05 67
DOID:16 DOID:16 integumentary system disease 55/524 357/6862 3e-07 1.85e-05 1.20e-05 55
DOID:3388 DOID:3388 periodontal disease 27/524 121/6862 3e-07 1.85e-05 1.20e-05 27
DOID:3310 DOID:3310 atopic dermatitis 28/524 130/6862 4e-07 2.23e-05 1.44e-05 28
DOID:824 DOID:824 periodontitis 24/524 104/6862 6e-07 3.79e-05 2.45e-05 24
DOID:104 DOID:104 bacterial infectious disease 41/524 243/6862 9e-07 4.75e-05 3.08e-05 41
DOID:5082 DOID:5082 liver cirrhosis 36/524 201/6862 1e-06 4.92e-05 3.18e-05 36

Q11: Does resulting significant disease list make sense in relation to Coeliac disease? Are there any diseases which could show similar expression profile as Coeliac disease? If not, speculate, what could be the reason of ambigous results?

A: Not really. It is not easy to interpret e.g. lung diseases in relation with Coeliac disease. Maybe “asthma” and “atopic dermatitis” might be in line as those are also immune-related diseases. However, you may see that most significant results are indicating quite general diseases (e.g. “lung disease”, “skin disease”, etc.) with >100 genes linked to the disease. This is due to hierchical nature of disease ontology, where there are also very broad “diseases” combining many potentially looseli related pathologies. This gives also quite broad enrichment results, as there are actually few “narrow-defined” diseases with so large number of known disease-related genes. It might be more informative to enforce upper limit for the size of tested gene sets- you can experiment with maxGSSize. Hint: Setting maxGSSize e.g. to 100, gives much more easily interpretable results, with immune-related diseases in top. However, you should always reason the choice of analysis settings based on analytical consideration (ideally before running the analysis), not based on which setting gives you the most appealing results!

4 Gene set enrichment analysis (GSEA)

Next we will use GSEA (Subramanian et al. 2005) on the diffrential analysis results. Unlike ORA, GSEA uses all the results from differential expression analysis, not only significant ones.

4.1 Run GSEA (Gene Set Enrichment Analysis) for KEGG pathways

For the sake of speed and brevity, we use recommended minimal number of 1000 permutations for this analysis (to get more stable and precise results your could increase the number of permutations to e.g. 10000 in real scientific work). Also, we test only KEGG pathways which have more than 50 known gene members.

KEGG_GSEA <- gseKEGG(geneList = cd3_gsea,
                         organism = 'hsa',
                         nPerm = 1000,
                         minGSSize = 50,
                         pvalueCutoff = 1,
                         verbose = FALSE, seed = T)

kable(head(KEGG_GSEA@result[, -c(10:12)], 15))
ID Description setSize enrichmentScore NES pvalue p.adjust qvalues rank
hsa04060 hsa04060 Cytokine-cytokine receptor interaction 253 0.4542574 1.780635 0.0012422 0.0313647 0.0244104 1526
hsa04630 hsa04630 Jak-STAT signaling pathway 147 0.5254741 1.947766 0.0013643 0.0313647 0.0244104 1429
hsa04114 hsa04114 Oocyte meiosis 110 0.4603085 1.637844 0.0013947 0.0313647 0.0244104 3610
hsa00240 hsa00240 Pyrimidine metabolism 95 0.5396680 1.895844 0.0014164 0.0313647 0.0244104 2716
hsa04110 hsa04110 Cell cycle 124 0.5998859 2.158882 0.0014164 0.0313647 0.0244104 2786
hsa01200 hsa01200 Carbon metabolism 108 0.5164693 1.828463 0.0014184 0.0313647 0.0244104 3952
hsa04657 hsa04657 IL-17 signaling pathway 86 0.5680668 1.962633 0.0014409 0.0313647 0.0244104 2661
hsa05132 hsa05132 Salmonella infection 80 0.5272237 1.799571 0.0014535 0.0313647 0.0244104 2189
hsa01230 hsa01230 Biosynthesis of amino acids 66 0.5227021 1.724760 0.0015015 0.0313647 0.0244104 3968
hsa05162 hsa05162 Measles 129 0.4302471 1.559420 0.0028011 0.0418149 0.0325435 2151
hsa05217 hsa05217 Basal cell carcinoma 61 -0.4873583 -1.749591 0.0028902 0.0418149 0.0325435 2714
hsa05321 hsa05321 Inflammatory bowel disease (IBD) 63 0.5091961 1.664157 0.0030441 0.0418149 0.0325435 866
hsa04380 hsa04380 Osteoclast differentiation 126 -0.4338584 -1.761521 0.0034364 0.0418149 0.0325435 1088
hsa03010 hsa03010 Ribosome 130 -0.4784401 -1.961815 0.0034483 0.0418149 0.0325435 3535
hsa04514 hsa04514 Cell adhesion molecules (CAMs) 131 -0.3546720 -1.452435 0.0035211 0.0418149 0.0325435 3501

Q10: How many pathways are significant if we consider Benjamini-Hochberg FDR?

A: 11

Q11: Are the results in line with ORA results?

A: Generally yes.

4.1.1 Visualize GSEA plot for most significant KEGG pathway

print(head(KEGG_GSEA@result, 1)$Description)
## [1] "Cytokine-cytokine receptor interaction"
gseaplot(KEGG_GSEA, geneSetID = head(KEGG_GSEA@result, 1)$ID)

5 Compare the biological themes of different stimulations

It is possible to easily compare the results several ORA analyses, using the functionality from compareCluster results. Next we read in and preprocess differentially expressed genes from all three conditions (UN-CeD vs control, unstimulated; CD3-CeD vs control, anti-CD3 stimulation; PMA-CeD vs control, PMA stimulation). We will run the KEGG ORA, using default settings and all genes tested in the study as an background.

# Read in and preprocess all different conditions:

# Unstimulated
unstim <- fread('/Users/urmovosa/Documents/Teaching/iBMS/practical/RNA_seq_part/DEG_lists/diffExpGenes_unstim_all.csv')

unstim <- unstim[!unstim$baseMean == 0, ]

entrez <- bitr(unstim$V1, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
unstim <- merge(unstim, entrez, by.x = 'V1', by.y = 'SYMBOL')
unstim_overrep <- unstim[unstim$padj < 0.05 & abs(unstim$log2FoldChange) > 1, ]

# CD3
cd3 <- fread('/Users/urmovosa/Documents/Teaching/iBMS/practical/RNA_seq_part/DEG_lists/diffExpGenes_cd3_all.csv')

cd3 <- cd3[!cd3$baseMean == 0, ]

entrez <- bitr(cd3$V1, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
cd3 <- merge(cd3, entrez, by.x = 'V1', by.y = 'SYMBOL')
cd3_overrep <- cd3[cd3$padj < 0.05 & abs(cd3$log2FoldChange) > 1, ]

# PMA
pma <- fread('/Users/urmovosa/Documents/Teaching/iBMS/practical/RNA_seq_part/DEG_lists/diffExpGenes_pma_all.csv')

pma <- pma[!pma$baseMean == 0, ]

entrez <- bitr(pma$V1, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
pma <- merge(pma, entrez, by.x = 'V1', by.y = 'SYMBOL')
pma_overrep <- pma[pma$padj < 0.05 & abs(pma$log2FoldChange) > 1, ]

input_comparison <- list(UNST = unstim_overrep$ENTREZID,
                         CD3 = cd3_overrep$ENTREZID,
                         PMA = pma_overrep$ENTREZID)

ck <- compareCluster(geneCluster = input_comparison, fun = "enrichKEGG",
                     universe = unstim$ENTREZID, 
                     pvalueCutoff = 1,
                     qvalueCutoff = 1)

We use dplyr and ggplot2 functionality to compare the 20 most significant pathways for each group. Here, the size of the dot indicates significance (\(-log_{10}(P)\)) and red dots indicate pathways withstanding multiple testing correction.

library(dplyr) #  we load this package only now because of apparent conflict with "pathview" package

# Next command does the following:
  # groups data based on stimulation (Cluster)
  # from each stimulation takes out 20 genes with smallest P-values
  # arranges all genes by P-value for visualization
  # reorders factor levels of "Description", so that ggplot2 could use the correct order
  # constructs variable "significance" to outline sets which are Benjamini-Hochberg FDR<0.05
  # uses ggplot to visualize the plot

compare_input <- ck@compareClusterResult

compare_input %>% 
  group_by(Cluster) %>%
  top_n(20, desc(pvalue)) %>%
  arrange(pvalue) %>%
  mutate(indic = row_number()) %>%
  ungroup() %>%
  mutate(Description = factor(Description, levels = rev(unique(as.character(Description))))) %>%
  mutate(significance = case_when(p.adjust < 0.05 ~ 'sig.',
                                  p.adjust >= 0.05 ~ 'not sig.')) %>%
  ggplot(aes(y = Description, x = Cluster, colour = significance, size = -log10(pvalue))) + 
  geom_point(stat = 'identity') + 
  theme_bw() + 
  scale_colour_manual(values = c('black', 'red'))

Q11: Are there any KEGG pathways overlapping between any of the conditions?

A: Yes, “Cytokine-cytokine receptor interaction”, “Inflammatory bowel disease”, between two stimulations.

Q12: What condition has the largest number of significantly enriched pathways?

A: Anti-CD3 stimulation.

Q13: Are the per-cohort results in line with the ORA results reported in the paper?

A: Generally yes, but there are also some differences.

6 Report the session information

Finally, when wrapping up your analysis, it is always a good idea to save the settings you have used. This involves software versions and the date of the analysis, making it possible to replicate or debug the results by different person.

Command sink is useful for saving any output from the screen into file.

Investigate the resulting file and note what information is saved.

sink("AnalysisInfo.txt")  # Open the connection to file.
sessionInfo()             # Prints information about R session.
Sys.time()                # Prints the information about analysis time stamp.
sink()                    # Close the connection to file.

7 Bonus Exercise

Next advanced tasks are meant to quickest students who have already finalized previous analyses.

ClusterProfiler has a very useful functionality to allow performing ORA and GSEA for any gene set database. This means that you can define the gene set yourself or use some custom gene set database which is not explicitly implemented to ClusterProfiler.

We will try to use this functionality on anti-CD3 treated Coeliac disease differential expression results.

Task 1 Investigate ClusterProfiler manual here to identify which commands allow you to perform universal ORA and GSEA. Note what extra file(s) are needed to do so.

Task 2 There are numerous gene sets available and downloadable on the web tool enrichr website (http://amp.pharm.mssm.edu/Enrichr/). Locate the downloadable databases from the web site, investigate what kind of gene sets are available and select one you would like to test in your data. Also, think what gene set dataset gives interesting information about differentially expressed genes in Coeliac disase. Download the desirable file.

NB! Some of the data files might not work- then just make another choice for now. Also, some of the datasets include numerical values after each gene name- skip those for now.

Task 3 Unfortunately, the data files are not in the “R-friendly” format. The biggest challenge in this bonus task is to read the file in and convert it to the usable data.frame. It should finally look like that:

term gene
term1 entrezID1
term2 entrezID2

Note that column names should be “term” and “gene”, and gene name should be ENTREZ ID.

  • Google and investigate read.table documentation, how to read tables with variable numbers of elements in each row into R.

    • Hint!: arguments like fill, na.strings, col.names are needed. Also the usage of command count.fields is mandatory.
  • If you manage to read the table into R, apply melt command from reshape2 package to convert data to long format, as you have previously learned in the R part. Install reshape2, if necessary.

  • Remove unnecessary column(s) and rows where gene name is “NA”. Make use of !is.na() for that.

  • Convert HGNC symbols to ENTREZ IDs, as we have done before. Do not forget to convert ENTREZ ID to character.

  • Merge the ENTREZ IDs to the intial table, manipulate the table to the format we need. Use the tricks you have learned today and on previous days for that.

Task 4 Run universal ORA, using differentially expressed genes from CD3 stimulation and your own gene set database. Put P-value and Q-value cutoffs to 1, to investigate top results even if none are significant.

Q: Did you find any significant and/or interpretable results?

Task 5 Run universal GSEA, using all ranked genes from CD3 stimulation and your own gene set database. Put P-value cutoff to 1, to investigate top results even if none are significant.

Q: Did you find any significant and/or interpretable results?

Task 6 Visualize your ORA and GSEA results using plots we have constructed today.

Convert the data into usable format.

# read in data

library(reshape2)
library(stringr)

database <- read.table('/Users/urmovosa/Downloads/TargetScan_microRNA.txt', 
             fill = T, 
             sep = '\t', 
             na.strings = '', 
             header = F, 
             col.names = paste0("V", 1:max(count.fields('/Users/urmovosa/Downloads/TargetScan_microRNA.txt', sep = '\t'))))

database <- database[, -2]
database <- melt(database, id.vars = 'V1')
database <- database[, -2]
colnames(database) <- c('term', 'gene')

database$gene <- str_replace(database$gene, ',.*', '')

entrez <- bitr(database$gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")

database <- merge(database, entrez, by.x = 'gene', by.y = 'SYMBOL')
database <- database[, c(2:3)]
colnames(database)[2] <- 'gene'
database$gene <- as.character(database$gene)

Run ORA

# enrichment analysis

enricher_output <- enricher(cd3_sig$ENTREZID, pvalueCutoff = 1, qvalueCutoff = 1, pAdjustMethod = "BH", universe = cd3$ENTREZID,
  minGSSize = 10, maxGSSize = 500, TERM2GENE = database,
  TERM2NAME = NA)

Run GSEA

gsea_output <- GSEA(cd3_gsea, pvalueCutoff = 1, pAdjustMethod = "BH",
  minGSSize = 10, maxGSSize = 500, TERM2GENE = database,
  TERM2NAME = NA)

References

Quinn, Emma M., Ciara Coleman, Ben Molloy, Patricia Dominguez Castro, Paul Cormican, Valerie Trimble, Nasir Mahmud, and Ross McManus. 2015. “Transcriptome Analysis of Cd4+ T Cells in Coeliac Disease Reveals Imprint of Bach2 and IFN\(\gamma\) Regulation.” PLOS ONE 10 (10). Public Library of Science: e0140049. doi:10.1371/journal.pone.0140049.

Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. doi:10.1073/pnas.0506580102.

Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS 16 (5). 140 Huguenot Street, 3rd FloorNew Rochelle, NY 10801USA: Mary Ann Liebert, Inc.: 284–87. doi:10.1089/omi.2011.0118[PII].